Remote direct memory access for iSCSI

ABSTRACT

A storage networking device provides remote direct memory access to its buffer memory, configured to store storage networking data. The storage networking device may be particularly adapted to transmit and receive iSCSI data, such as iSCSI input/output operations. The storage networking device comprises a controller and a buffer memory. The controller manages the receipt of storage networking data and buffer locational data. The storage networking data advantageously includes at least one command for at least partially controlling a device attached to a storage network. Advantageously, the storage networking data may be transmitted using a protocol adapted for the transmission of storage networking data, such as, for example, the iSCSI protocol. The buffer memory advantageously is configured to at least temporarily store at least part of the storage networking data at a location within the buffer memory that is based at least in part on the locational data.

RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional PatentApplication No. 60/469,556, which is entitled “Remote Direct MemoryAccess” which was filed Feb. 18, 2003. The foregoing provisional patentapplication is hereby incorporated by reference in its entirety. Thisapplication claims the benefit of U.S. patent application Ser. No.10/781,341, now U.S. Pat. No. 7,512,663, which is entitled “Systems andMethods of Directly Placing Data in an ISCSI Storage Device” which wasfiled Feb. 18, 2004.

FIELD OF THE INVENTION

Aspects of the invention relate generally to remote direct memory accessfor iSCSI.

BACKGROUND

Recently, protocols have been developed for accessing data storage overnetworks. These protocols form the basis for new classes of networkstorage solutions wherein data is remotely stored and distributed withinboth storage area networks (SANs) and across larger public networksincluding the Internet. The iSCSI transport protocol standard definesone such approach for accessing and transporting data over commonlyutilized Internet Protocol (IP) networks. Using the iSCSI command andinstruction set, conventional Small Computer Systems Interface (SCSI)commands, typically associated with communication within locallymaintained storage devices, may be encapsulated in a network-compatibleprotocol wrapper allowing SCSI communication between devices in remotemanner. The iSCSI protocol may further be used by a host computer systemor device to perform block data input/output (I/O) operations with anyof a variety of peripheral target devices. Examples of target devicesmay include data storage devices such as disk, tape, and optical storagedevices, as well as, printers, scanners, and other devices that may benetworked to one another to exchange information.

In a SAN environment where storage devices are remotely accessibleacross a network, the block data operations associated with the iSCSIprotocol may be structured so as to be compatible with the generalmanner of processing associated with existing storage devices. Forexample, disk drives may read and write using a fixed block size (e.g.512-byte block). In contrast, computer applications may require accessto a file of arbitrary length. One function of a computer file system isto change file-oriented requests associated with an application intoblock-level instructions that may be recognized and processed by thestorage devices. Using the iSCSI protocol, such application requests maybe processed by the file system to generate storage device compatibleinstructions thereby permitting storage and retrieval of information.

From an application or software perspective, an iSCSI device generallyappears as a locally-attached SCSI device. As with the standard SCSIprotocol, iSCSI information exchange is based on communication betweenan initiator device and a target device (e.g. client/server model). AniSCSI device that requests a connection to the target device and issuesan initial series of SCSI commands is referred to as the initiator. AniSCSI device that completes the connection to the initiator and receivesthe initial SCSI commands is referred to as the target. One function ofthe initiator is to generate SCSI commands (e.g. data storage and accessrequests) that are passed through an iSCSI conversion layer where theSCSI commands are encapsulated as iSCSI protocol data units (PDUs).Thereafter, the iSCSI PDUs may be distributed across a network to thetarget device where the underlying SCSI instructions and data areextracted and processed. In a similar manner, the target device maytransmit data and information prepared using the SCSI responses andencapsulated as iSCSI PDUs to be returned to the initiator.

SUMMARY OF THE INVENTION

Embodiments of the invention described herein provide remote directmemory access to buffer memory, located within a storage networkingdevice, for storing storage networking data. Advantageously, providingdirect memory access to buffer memory reduces processing resourcesdedicated to performing input and output operations to buffer memory,thus increasing efficiency of transmissions between storage networkingdevices. As used herein, storage networking data includes all types ofdata that can be stored using a storage network, including, for example,commands for controlling storage networking devices. A storagenetworking device, according to one embodiment, is particularly adaptedto transmit and receive iSCSI data, such as iSCSI input/outputoperations. In one embodiment, the storage networking device isconfigured to transmit data using other protocols, either in place ofiSCSI data or in addition to iSCSI data.

In one embodiment, the storage networking device comprises a controllerand a buffer memory. The controller manages the receipt of storagenetworking data and buffer locational data. According to an embodiment,the storage networking data includes at least one command for at leastpartially controlling a device attached to a storage network.Advantageously, the storage networking data may be transmitted using aprotocol adapted for the transmission of storage networking data, suchas, for example, the iSCSI protocol. The buffer memory, in oneembodiment, is configured to at least temporarily store at least part ofthe storage networking data at a location within the buffer memory thatis based at least in part on the locational data. Advantageously, thisallows the storage networking device to provide direct access to thedestination buffer memory without going through intermediate bufferingand multiple memory copying.

Advantageously, embodiments described herein can be used either totransmit storage networking data from a initiator device to a targetdevice or to transmit storage networking data from a target device to aninitiator device.

In one embodiment, the storage networking device can transmit a pointerto a location in buffer memory within the locational data. This allowsthe storage networking device to extract the pointer from the locationaldata in order to determine where within the buffer memory the receivedstorage networking data is to be stored. Alternatively or additionally,the storage networking device can transmit an index to a data pointertable within the locational data. In such an embodiment, the index canbe extracted and used to refer to a data pointer table and generate apointer to a location within the buffer memory. In an embodiment, theindex can be encrypted for added security.

In an embodiment, the storage networking device is configured totransmit the information upon which the locational data is based to theremote storage networking device. This embodiment allows the storagenetworking device to assign a location within buffer memory that is tobe used to store received data. In an embodiment, the storage networkingdata transmits the information upon which the locational data is basedto the remote storage networking device in a packet that indicates thatthe storage networking device is ready to receive data from the remotestorage networking device.

Advantageously, the storage networking device can also include aconnection lookup table. The connection lookup table, in one embodiment,defines a plurality of connections between the storage networking deviceand one or more remote storage networking devices. In one embodiment,the locational data, in addition to specifying a location in buffermemory at which storage networking data is to be stored, also identifiesone of the connections in the connection lookup table. In anadvantageous embodiment, the locational data is used to verify that datareceived by the storage networking device comes from a recognizedconnection. This advantageously enhances security of any datatransmission to the storage networking device.

Advantageously, each of the foregoing embodiments can also have iSCSIacceleration hardware that accelerates the processing of iSCSIcommunications received by the storage networking device.

Embodiments of the storage networking device perform the followingmethod of storing data in a directly accessible buffer memory of astorage networking device: (1) storage networking data and firstlocational data are received over a network from a remote storagenetworking device. (2) A location within the buffer memory is determinedbased at least in part on the first locational data. (3) The storagenetworking data is stored within the buffer memory at the locationdetermined at least in part by the first locational data. In oneembodiment, the storage networking data includes at least one commandfor at least partially controlling a device attached to a storagenetwork and is transmitted using a protocol adapted for the transmissionof storage networking data, such as iSCSI. In one embodiment, the firstlocational data is configured to specify at least indirectly a locationwithin a buffer memory of a storage networking device.

In one embodiment, the storage networking device also transmits secondlocational data to a remote storage networking device. In oneembodiment, the first locational data is substantially the same as thesecond locational data, such that the storage networking device assignsthe location within buffer memory that the storage networking data isstored.

With regard to embodiments of the foregoing method, the location may bedetermined by generating a pointer into the buffer memory from the firstlocational data. Generating the pointer may include extracting thepointer from the first locational data, or may include extracting anindex into a data pointer table from a part of the first locational dataand using the index to extract the pointer from the data pointer table.Advantageously, the part of the first locational data comprising anindex may be encrypted within the first locational data.

A skilled artisan will appreciate, in light of this disclosure, thatmany variations exist of embodiments that are covered by the principlesof the invention disclosed herein, many of which are describedexplicitly herein. Described embodiments are disclosed herein by way ofexample and not of limitation. A skilled artisan will appreciate, inlight of this disclosure, how to practice the invention according tostill other embodiments not explicitly described herein. The inventioncovers all such embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features will now be described with reference to thedrawings summarized below. These drawings and the associated descriptionare provided to illustrate one or more preferred embodiments of theinvention, and not to limit the scope of the invention.

FIG. 1 illustrates an exemplary communications network comprising aplurality of iSCSI-enabled devices.

FIG. 2 illustrates a high level block diagram of an iSCSI hardwaresolution for providing iSCSI processing functionality.

FIG. 3 illustrates another high level block diagram of the iSCSIhardware solution for providing iSCSI processing functionality.

FIG. 4 illustrates a modified iSNP Open System Interconnect (OSI) modelfor network communications.

FIG. 5A illustrates an exemplary iSCSI command PDU.

FIG. 5B illustrates an exemplary iSCSI jumbo frame data PDU.

FIG. 5C illustrates an exemplary iSCSI standard frame data PDU.

FIG. 6 illustrates a connection between a network and I/O devices.

FIG. 7 illustrates an example iSCSI READ sequence.

FIG. 8 illustrates an example iSCSI WRITE sequence.

FIG. 9A illustrates three segments of buffer memory.

FIG. 9B illustrates sample R2T PDUs.

FIG. 9C illustrates a data pointer table/

FIG. 9D illustrates sample Data Out PDUs.

DETAILED DESCRIPTION

Although certain preferred embodiments and examples are disclosed below,it will be understood by those of ordinary skill in the art that theinvention extends beyond the specifically disclosed embodiments to otheralternative embodiments and uses of the invention and obviousmodifications and equivalents thereof. Thus, it is intended that thescope of the claimed invention should not be limited by the particularembodiments described below.

The details and specification for the iSCSI protocol standard have beendetailed by a working group under the Internet Engineering Task Force(IETF). At the time of this writing, Internet Draftdraft-ietf-ips-iscsi-20.txt is the most recent version of the iSCSIprotocol, and is hereby incorporated by reference in its entirety.Nothing in this disclosure, including the claims, is intended to limitthe invention to any particular implementation of the iSCSI protocol orthe TCP/IP protocol, and the terms “iSCSI” and “TCP/IP” as used hereinencompass present and future developments to the iSCSI and TCP/IPprotocols, respectively.

In one aspect, the present teachings describe a remotely-accessiblestorage architecture that may be adapted for use with networks whichimplement packetized information exchange using for example,Transmission Control Protocol/Internet Protocol (TCP/IP) connectivity.The storage architecture comprises an Internet Protocol Storage AreaNetwork (IP-SAN) that may serve as a replacement for Fibre ChannelStorage Area Networks (FC-SAN) as well as other convention networkattached storage (NAS) and direct storage solutions.

Improvements in transmission efficiency and data throughput compared tomany conventional NAS storage environments may be realized usingspecialized processing of iSCSI commands and information. These iSCSIprocessing functionalities may be desirably implemented usingconventional network infrastructures without significant alterations orupgrades. For example, it is conceived that present teachings may beused in connection with conventional Ethernet configurations whereinrouters and switches direct the flow of information throughout thenetwork. One desirable benefit realized when using such animplementation is that a relatively low cost/high performance networkstorage environment can be created based on an existing network withoutthe need to perform substantial costly network upgrades.

The use of dedicated Fibre channel lines and specialized Fibre channelhardware is also not necessary to gain the benefit of high throughputnetwork storage. It will be appreciated, however, that the systems andmethods described herein may be readily adapted for use with numerousdifferent types of networking technologies, including Fibrechannel-based technologies, to help improve performance and reliabilityin network storage and data distribution. It will further be appreciatedthat the present teachings may be adapted for use in networks containingmixed technologies such as partial Fibre channel/partial conventionalEthernet.

One significant drawback to conventional methods for iSCSI commandprocessing and data encapsulation is that it generally takes place atthe software/application level. This presents a problem in that storageservers and their associated storage devices often bear the burden ofencapsulation and resolution of SCSI commands. This may place a heavycomputational burden upon the file system and CPU of both the initiatorand target. In bandwidth intensive applications, where numerous storagerequests/accesses are made, target storage devices equipped with suchsoftware-limited solutions typically encounter diminished throughput andbottlenecks in data transfer. In one aspect, the present teachings mayadvantageously be used to overcome the limitations of software/filesystem iSCSI information processing through the use of iSCSI hardwareacceleration devices. As will be described in greater detail hereinbelow, hardware acceleration desirably provides a means to reduce thecomputational burden of iSCSI packet processing by offloading this taskfrom the file system and CPU onto an iSCSI protocol processing board orstorage server blade. Use of the storage server blade may improveperformance through dedicated hardware circuitry that is designed tomore efficiently handle the tasks associated with iSCSI processing ascompared to conventional methods. Furthermore, use of a separatehardware-based processing device removes at least a degree ofcomputational overhead from the file system and CPU to permit otherstorage-related and unrelated computational tasks to be more efficientlyperformed.

FIG. 1 illustrates an exemplary communications system for remoteinformation storage and retrieval comprising several iSCSI devices thatcommunicate over a network 100. In the illustrated embodiment,application servers 104 possess suitable network functionality toexchange information with other network devices including switches androuters. In one aspect, the application servers 104 comprise computersor other devices which access informational resources contained within astorage server 106. The storage server 106 comprises a server blade 108and at least one storage device 110. In one aspect, the server blade 108comprises a hardware device that provides network connectivity for theat least one storage device 110 and further communicates with theswitches and routers used in the network 100.

In various embodiments, the network infrastructure which interconnectsthe application servers 104 to the server blade 108 comprises gigabitEthernet connectivity with suitable gigabit Ethernet switches androuters. Although FIG. 1 is illustrated as possessing gigabit Ethernetfunctionality, it will be appreciated that other network classes such aswide-area networks (WANs), private networks, or the Internet may beserve as a suitable network infrastructure. Likewise the hardwarecomponents and devices described in connection with the presentteachings may be adapted for use with these and other networkconfigurations including conventional wired networks and opticalnetworks (e.g. Fibre channel).

Each application server 104 uses a host bus adaptor 114 that provides ameans for network communication between the application servers 104 andthe network 100. Each application server 104 may further be connecteddirectly to the storage server 106 such that no switches or routers arenecessary to exchange information in the storage network. Additionally,multiple application servers may communicate with a single storageserver, and a single application server may communicate with multiplestorage servers. In one embodiment, link aggregation, such as thatdefined by the I.E.E.E. 802.3ad specification may be used to allow formultiple connections between an application server and a storage server.For a review of link aggregation methodologies and the associated802.3ad specification the reader is referred to IEEE Std 802.3, 2000Edition. Section 43 of this standard corresponds to IEEE 802.3ad and ishereby incorporated by reference in its entirety.

Each application server 104 transmits requests for stored resourceslocated on the storage devices 110. As will be described in greaterdetail herein below, informational requests may take the form of iSCSIPDUs that are sent from the application server 104 to the storage server106. Furthermore, the HBA 114 of each application server 104 may performoperations associated with forming an appropriate connection to thestorage server 106 and encapsulating application server SCSI commands asiSCSI instructions. These iSCSI instructions are received by the serverblade 108 wherein they may be decoded and the requested operationsassociated with the storage devices 110 performed. In a similar manner,the server blade 108 may encapsulate SCSI commands and storage deviceinformation as iSCSI instructions to be returned to the applicationserver 104 for processing.

The server blade 108 may also be configured to provide other desirablefunctionalities such as high availability features which implementbackup and failover provisions. In one aspect, the server blade 108 mayfurther serve as a controller for a redundant array of independent disks(RAID) to provide mirroring, redundancy, and backup functionalitiesthrough the use of a plurality of storage devices 110 interconnected tothe server blade 108. Additionally, two or more server blades mayoperate in a coordinated manner to provide additional high availabilityfunctionalities as well as load balancing and distributionfunctionalities. Another feature of the server blade 108 is that it maybe designed to be compatible with conventional iSCSI HBAs such thatexisting applications servers 104 which already possess an iSCSI enabledHBA may not require replacement to operate with the storage server 106of the present teachings.

In various embodiments, a management console 112 may further connect tothe network 100. The management console 112 may be associated with anapplication server 104 or other computer or software-based applicationthat is able to remotely perform administrative functions within thestorage server 106 and/or various applications servers 104 locatedthroughout the network 100. In one aspect, the management console 112may be used to provide software updates and/or firmware revisions to theserver blade 108 or storage devices 110 of the storage server 106. Useof the management console 112 also provides a means to remotely view andmodify the operational parameters of the storage server 106 in aconvenient manner.

FIG. 2 illustrates a high level block diagram of an iSCSI hardwaresolution 120 that provides iSCSI processing functionality for the serverblade 108 and/or application server HBAs 114. In one aspect, a storagenetwork processor (iSNP) 122 is interconnected with an internal CPU 124.The iSNP 122 and CPU 124 are principally responsible for the processingof iSCSI instructions and data. In one aspect, the iSNP 122 and CPU 124provide the necessary functionality to encapsulate/decapsulate SCSIinstructions and data independent of the rest of the system therebyoffloading iSCSI communication overhead.

A memory area 126 is further associated with the iSNP 122 wherein thememory provides buffering functionality for iSNP 122. The memory area126 may further be subdivided into separate areas including a systemmemory area 128 and a buffer memory area 130. The system memory 128 maybe used to cache iSCSI instructions and commands associated with sendingand receiving data throughout the storage network 100. In a similarmanner, the data buffer 130 may be used to cache data and informationthat is transported throughout the storage network 100.

The iSNP 122 is further associated with a storage device controller 132.The storage device controller 132 represents a hardware interfacebetween the server blade 108 and the storage devices 110. The storagedevice controller 132 may be a conventional controller (e.g. aconventional ATA, serial ATA or SAS controller) or may be a dedicateddesign that is integrated into the server blade 108. In variousembodiments, a compatible bus 134 may provide a means for communicationbetween the iSNP 122 and the storage device controller 132. Furthermore,one or more storage device controllers 132 may be associated with asingle iSNP 122 to provide accessibility to multiple storage devices 110through one or more buses. Each bus 134 may further adhere to aconventionally used communications standard such as a peripheral controlinterconnect (PCI) bus, a PCI-X bus, or PCI-Express bus.

The iSNP 122 is further associated with a suitable network interface 136to provide a means for communicating across the network 100. In oneaspect, the network interface 136 transmits and receives iSCSI PDUs andacts as an interconnect between the iSNP 122 and other devices presentin the network 100. The network interface 136 may comprise a singleinterface 138 or an aggregated interface 140 which use any of a numberof different networking implementations. As will be described in greaterdetail herein below, the network interface 136 may comprise a XGMII/XAUIinterface which allows interconnection between a Media Access Control(MAC) sublayer of the iSNP 122 and a Physical layer (PHY) of the 10gigabit Ethernet network. Additionally, the network interface 136 maycomprise a GMII/MII or TBI/SerDes interface for interconnecting to a1000 based network, a 100/10 based network or other network type. Itwill be appreciated that numerous different interface specificationsexist for the purpose of providing network connectivity; as such, it isconceived that any of these interfaces may be configured to operate withthe iSNP 122 without departing from the scope of the present teachings.

It will be appreciated that the principle components of the iSCSIhardware solution 120 may differ somewhat between that used in theserver blade 108 and those used in the application server HBAs 114. Forexample, the server blade 108 may be configured to accommodate higherbandwidth by providing an increased iSNP 122 processor speed, additionalmemory 126, multiple controllers 132, and/or higher capacity networkinterfaces 136. Furthermore, HBA-associated iSCSI hardware solutions maylack certain components that are not required in iSCSI communicationsuch as the controller 132 if no storage devices directly interconnectto the associated device.

FIG. 3 illustrates another high level block diagram of the iSCSIhardware solution 120 for providing iSCSI processing functionality. Invarious embodiments, the iSCSI hardware solution 120 is implemented asan application-specific integrated circuit (ASIC) and may support a verylong instruction word (VLIW) architecture for one or more of thesubcomponents contained therein. In one aspect, the iSNP 122 may befurther divided into a series of modules forming a protocol interceptengine (PIE) 142. The PIE subsystem 142 comprises a receive (Rx) module144, a transmit (Tx) module 146, and an acknowledgment/windowing/andretransmit (AWR) module 148. As will be described in greater detailherein below, the modules of the PIE subsystem 142 are responsible forproviding hardware accelerated storage/retrieval functionality andfurther offload iSCSI protocol processing from the software and/or filesystem of the storage server 106 and the application servers 104. In oneaspect, the PIE subsystem 142 achieves a high level of computationalperformance through using one or more dedicated processors (e.g. RISCprocessors) and may further incorporate one or more hardwired statemachines to perform tasks associated with iSCSI instruction and commandprocessing.

The PIE Subsystem 142 communicates with other components of the iSCSIhardware solution 120 through an internal system bus 150. The CPU 124,shown in FIG. 3 as a CPU complex, may be formed as a collection of oneor more processors that may be clustered in a hierarchical structure. Inone aspect, CPU clustering desirably provides a means to selectivelyexpose certain processors to the system bus 150 and may allow clusteredprocessors to operate in a substantially independent manner therebyimproving performance and iSCSI instruction processing capabilities.

A storage controller interface 151 may also be interconnected to thesystem bus 150 wherein the system controller interface 151 provides abridge to the storage controller bus 134. In one aspect, the storagecontroller interface 151 comprises at least one external bus (e.g. PCI,PCI-X, PCI-Express, etc) that interconnects to the controller 132 andoptionally a management bus which may serve as an MPU interface.

In various embodiments, two or more iSCSI hardware solutions 120 may bedesirably interconnected to provide improved high-availabilitycapabilities, failover, load-balancing, and redundancy features.Furthermore, near zero latency performance can be achieved therebyreducing performance bottlenecks and increasing overall throughput.Interconnection in this manner is accomplished through a coordinatedsystem memory controller 152, reflective memory controller 154, and abuffer memory controller 156 which are linked in a peer to peer mannerbetween each interconnected iSCSI hardware solution 120. Together thesecomponents 152, 154, 156 are responsible for communicating andcoordinating the activities of each hardware solution 120 with respectto one another. These components also provide advanced error correctionand data recovery functionality that aid in improving performance inhigh-throughput applications.

A structured memory accelerator 158 may also be integrated into theiSCSI hardware solution 120 to provide advanced queuing of instructionsand commands. In one aspect, the structured memory accelerator 158interacts with the buffer memory 126 to improve performance duringinstruction retrieval by the PIE subsystem 142 and/or the CPU 124. Usingthe structured memory accelerator 158 further offloads the burden ofsoftware command processing by providing an efficient hardware solutionfor command processing when appropriate. A desirable feature of thestructured memory accelerator 158 is that it is capable of handling andmanaging a large number of queues simultaneously to enhance systemperformance. This feature further provides the ability to manage a largenumber of incoming network connections such that incoming requests frommany devices may be resolved and serviced without excessive loadpenalties.

As previously indicated the PIE subsystem 142 provides a number ofsignificant features and functionalities related to the processing ofiSCSI PDUs. The iSNP/PIE components process and pass information througha layered storage networking stack wherein one or more of the followingfunctions are performed by the iSNP/PIE components in selected layers.Informational flow throughout the layers may be described in terms of aprotocol data unit (PDU) that defines a generic unit of transfer betweenselected protocol layers. The PDU may contain control information,commands, instructions, as well as data and information relating to thedesired send/receive requests.

Some exemplary functions generally associated with the iSNP/PIEcomponents include:

Segmentation and Reassembly: A PDU may be subdivided into smaller PDUsfor reasons relating to transfer limitations, improved quality ofservice, desirable buffer size, or error control functionality.Likewise, segmented PDUs may be assembled into larger PDUs for similarreasons.

Encapsulation: Control information may be included in the PDU thatspecifies information including, source and destination addresses,status, and/or other options. The control information associated withthe PDU may include information associated with different aspects of theprotocols in use within the network. For example, a PDU may besimultaneously associated with SCSI control information, iSCSI controlinformation, and other network control information each of which isincorporated into the PDU for subsequent processing during networkcommunication.

Connection control: PDUs may be transferred between devices withoutprior coordination (e.g. connectionless transfer/broadcast). It isgenerally desirable however, for devices within the network to establisha logical association or network connection between one another prior todata exchange. In one aspect, establishment of the network communicationproceeds according to handshaking or connection rules. Once theconnection has been established, data may be transferred between thedevices in a coordinated manner. Coordination may take the form ofsequentially numbering or organizing PDUs to be transferred tofacilitate sequencing of the data providing a means for ordereddelivery, flow control, and error recovery.

Flow control: Overflow in the device receiving stored data andinformation is substantially prevented by sending PDU sequenceinformation describing a selected range of information requested fromthe sending device. Thereafter, the sending device transmits the PDUswithin the selected range and refrains from transmitting PDUs outside ofthe selected range. Once the receiving device has received all of thePDUs in the selected range, the receiving device then indicates a nextrange of selected PDUs to be transmitted.

Error control: To recover lost or corrupted PDUs sequence numbers may beused as a reference and positive acknowledgement of valid PDUs may beperformed by the receiving device. Furthermore, when a PDU remainsunacknowledged for a selected amount of time, the PDU may beretransmitted by the sending device to thereby account for a PDU whichmay be lost or partially received.

FIG. 4 illustrates a modified iSNP Open System Interconnect (OSI) model160 for network communications wherein the iSNP/PIE components performoperations associated with SCSI, iSCSI, and network command 1information processing. According to the model 160, storage networkingand communication generally follow a layered, or hierarchical approachwherein a plurality of layers exist to perform selected functionsrelated to the processing of information.

The principal layers of the iSNP model 160 include a SCSI layer 162, aniSCSI layer 164, a TCP layer 166, an IP layer 168, an Ethernet layer170, and a physical layer 172. The SCSI layer 162 implements the SCSIcommand set wherein storage block data operations (e.g. input/output) toSCSI devices are performed and managed. The iSCSI layer 164 transportsSCSI I/O over an IP network through the use of iSCSI protocol dataunits. Within the iSCSI layer 164, storage device write commands areprocessed and made ready for transmission.

The TCP layer 166 serves as the principal end-to-end network protocoland is typically used for establishing a reliable (connection-oriented)session between the sending and receiving devices. Each of the PDUs tobe transmitted may include a TCP segment header that is used to orientand arrange each of the PDUs with respect to one another when received.

The IP layer 168 serves as a connectionless service that is typicallyused to route data and information between network devices. Each of thePDUs to be transmitted may include a IP packet header that contains thesource address, destination address, and ID information. Segmentation ofthe PDUs may occur at this layer wherein each PDU is broken into asuitable size/number of IP fragments to accommodate a maximum transferunit (MTU) of the network. On the receiving side, segmented packets maybe reassembled at the IP layer 168 prior to passing up to the TCP layer166.

The Ethernet layer 170 serves as the media access control (MAC) protocolhandler to transfer Ethernet frames across the physical link (e.g.physical network connection/layer). In one aspect, each PDU contains aMAC address that serves as a universal vendor-specific address that ispre-assigned for each network device.

The physical layer 172 defines a physical medium (e.g. physical cable orconnection type) and provides the electrical and mechanical means tomaintain the physical link between systems.

From the perspective of the iSNP 122, SCSI layer 162 and part of iSCSIlayer processing 164 generally occur at the software level whereas partof iSCSI layer, TCP layer 166, IP layer 168, and Ethernet layer 170processing occur using software or hardware acceleration. In one aspect,hardware acceleration desirably improves data transmission performanceand provides a means to rapidly transmit storage data and information ina more efficient manner as compared to conventional network storagesolutions.

FIG. 5A illustrates an exemplary iSCSI command PDU segment 232 thatincorporates the header information for various layers of the iSNPstorage network stack 160. The packet header is organized such that theheaders are interpreted according to the order of the stack 160. Forexample, the iSCSI command PDU segment 232 comprises an Ethernet header234, an IP header 236, and TCP header 238, and an iSCSI header 240 eachof which are arranged substantially adjacent to one another. One or moreof the headers 234, 236, 238, 240 may further include checksum or errorcorrection information 582 that may be used to verify the receivedinformation during the various stages of decoding and resolution toinsure integrity in the transfer of data and command information.

When processing of the iSCSI command PDU segment 232, the Ethernetheader 234 is first decoded/interpreted by the Ethernet layer 170 of thereceiving device and passes the remaining contents of the PDU to thenext higher layer which in the illustrated example is the IP layer 168.Subsequently, the IP header 236 is decoded/interpreted and the remainingcontents of the PDU passed to the next higher layer. The above-describedmanner of processing proceeds sequentially for each header portioncontinuing through the decoding of the iSCSI header 164. Thereafter, anunderlying SCSI command 242 is resolved and may be executed by thereceiving device to accomplish tasks associated with storage andretrieval of information.

FIG. 5B illustrates an exemplary iSCSI jumbo frame data PDU 244containing a data or information segment 246 associated with theaforementioned iSCSI command header information. In one aspect, the datasegment 246 is preceded by the iSCSI command PDU segment 232 whichcarries the necessary information for the data segment 246 to beinterpreted by the receiving device. One desirable benefit of thestorage network of the present teachings is that relatively large dataframes are supported which may be useful in improving data transmissionefficiency between devices. Additionally, only a single iSCSI commandPDU segment 232 need be associated with the data segment 246 thereforereducing the amount of information which must be transmitted betweendevices. In the illustrated example, a data segment 246 of 2048 words or8192 bytes is shown however, it will be appreciated that other datasegment sizes can be readily supported.

FIG. 5C illustrates an exemplary iSCSI standard frame data PDU 248comprising a plurality of sub-segments each of which are associated withan initial iSCSI command PDU segment 232. In the exemplary iSCSIstandard frame data PDU 248, each sub-segment comprises separate headerinformation 254 which is transmitted with the sub-segments 252. Duringthe receipt and decoding of the iSCSI standard frame data PDU 248 eachof the sub-segments are re-associated to join the information storedtherein. The use of separate header information 254 in theaforementioned manner allows smaller frame sizes to be transmitted whichmay be re-associated and provides a means to recover or retransmitsmaller portions of data as compared to the jumbo frame PDU 244.

Remote Direct Memory Access

FIG. 6 illustrates a simplified block diagram for communicating databetween a network 100 and storage I/O devices 110. Buffer memory 130provides an interface between the stream-oriented network 100 and theblock-oriented storage I/O devices 110. Data arriving from the network100 is assembled in buffer memory 130. The storage I/O devices 110 canthen access the assembled data as blocks in buffer memory 130.Similarly, the storage I/O devices 110 can transfer blocks of data tobuffer memory 130 for transmission over the network 100. The network hasdirect access to the buffer memory 130 through a buffer memorycontroller 156 (see FIG. 3) within the iSCSI controller 120. Similarly,the storage I/O devices 110 have direct access to buffer memory 130using, for example, a PCI-X to Serial ATA Interface 132 and a storagecontroller interface 151 (see FIG. 3) within the iSCSI controller 120.The specific embodiment illustrated in FIG. 6 is exemplary only, andother types of connections and interfaces may also be appropriate forcommunicating with the buffer memory 130.

In one embodiment, at least a portion of the data communicated over thenetwork 100 conforms to the iSCSI protocol. Referring to FIG. 6, theiSCSI controller 120 receives iSCSI PDUs from the network 100, andtransmits iSCSI PDUs to the network 100. The iSCSI protocol is a mappingof the SCSI remote procedure invocation model over the TCP protocol. Atthe highest level, SCSI is a family of interfaces for requestingservices from I/O devices, such as hard drives, tape drives, CD and DVDdrives, printers, and scanners.

A SCSI target device may contain one or more SCSI target ports. iSCSIPDUs that are directed to a specific port need a means to identify thatport. This is accomplished in the iSCSI protocol by providing headersegments that comprise various control fields. For example, the basicheader segment in some PDUs comprises a Logical Unit Number (LUN) field.The LUN field is 64 bits and is formatted in accordance with the SCSIstandards.

The basic header segment also comprises an opcode field. The opcodeindicates the type of iSCSI PDU corresponding to the basic headersegment. The opcodes are divided into two categories: initiator opcodes,which are sent by the initiator, and target opcodes, which are sent bythe target. Example initiator opcodes include Login request, SCSIcommand, SCSI Data-out (for WRITE operations), and Logout request.Example target opcodes include Login response, SCSI response, SCSIData-in (for READ operations), Ready to Transfer (R2T), and Logoutresponse.

For the particular case when the opcode is a SCSI command, the SCSIcommand PDU includes a SCSI Command Descriptor Block (CDB). CommandDescriptor Blocks contain the command parameters that an initiator sendsto a target. The CDB content and structure correspond to the SCSIstandard. For example, the initiator could request that the targettransfer data to the initiator by sending a SCSI command PDU with a CDBthat corresponds to SCSI READ command.

FIG. 7 illustrates an example of the use of the iSCSI protocol toperform a READ operation. As shown at step 620, the initiatorcommunicates a “READ” SCSI command PDU to the target. The “READ” commandPDU comprises fields that contain the following information:

-   -   LUN—The logical unit number of the SCSI device to read from;    -   Expected Data Transfer Length—The number of bytes of data the        target is expected to transfer to the initiator; and    -   SCSI Command Descriptor Block (CDB)—In this example, the CDB        corresponds to the SCSI command descriptor block for the READ        command.

A SCSI CDB typically comprises an Operation Code field, a SCSI LogicalUnit Number field, and a Logical Block Address field. The CDB mayadditionally comprise a Transfer Length field, a Parameter List Lengthfield, and an Allocation Length field. The SCSI CDB for the READ commandrequests that the target transfer data to the initiator. Within the READCDB, the Logical Block Address field specifies the logical block atwhich the read operation begins. The Transfer Length field specifies thenumber of contiguous logical blocks of data to be transferred. Using theLogical Block Address field and the Transfer Length field, the targetknows where to begin reading data and the size of the data segment toread. At step 622, the target prepares the data for transfer.

The requested data is communicated from the target to the initiatorusing one or more iSCSI PDUs. As shown at steps 624, 626 and 628, therequested data in the illustrated example is transmitted from the targetto the initiator using three iSCSI Data-in PDUs. The SCSI Data-in PDUcomprises a Buffer Offset field that contains the offset of the PDUpayload data within the complete data transfer. Thus, the starting pointof each SCSI Data-in PDU can be determined by adding the Buffer Offsetof the SCSI Data-in PDU to the Logical Block Address field of the READcommand PDU.

At step 630, the target transmits a SCSI response PDU to indicate thatthe READ command is complete.

FIG. 8 illustrates an example of the use of the iSCSI protocol toperform a WRITE operation. As shown at step 640, the initiatorcommunicates a “WRITE” SCSI command PDU to the target. The “WRITE”command PDU comprises fields that contain the following information:

-   -   LUN—The logical unit number of the SCSI device to write to;    -   Expected Data Transfer Length—The number of bytes of data the        target is expected to transfer to the initiator; and    -   SCSI Command Descriptor Block (CDB)—In this example, the CDB        corresponds to the SCSI command descriptor block for the WRITE        command.

The SCSI CDB for the WRITE command requests that the target write datatransferred by the initiator to the write medium. Within the WRITE CDB,the Logical Block Address field specifies the logical block at which thewrite operation begins. The Transfer Length field specifies the numberof contiguous logical blocks of data to be transferred. Using theLogical Block Address field and the Transfer Length field, the targetknows where to write data and the size of the data segment to write. Atstep 642, the target processes old commands before starting the newWRITE command.

When an initiator has submitted a SCSI Command with data that passesfrom the initiator to the target (WRITE), the target may specify whichblocks of data it is ready to receive. The target may request that thedata blocks be delivered in whichever order is convenient for the targetat that particular instant. This information is passed from the targetto the initiator in the Ready to Transfer (R2T) PDU. This transmissionis shown at step 644, where the target communicates to the initiatorthat it is ready to process the first data block corresponding to theWRITE command. The R2T PDU comprises fields that contain the followinginformation:

-   -   LUN—The logical unit number of the SCSI device to write to;    -   Desired Data Transfer Length—The number of bytes of data the        target expects to receive from the initiator    -   Buffer Offset—The offset of the requested data relative to the        buffer address from the execute command procedure call; and    -   Target Transfer Tag—The target assigns its own tag to each R2T        request that it sends to the initiator. This tag can be used by        the target to identify the data that the target receives from        the initiator. The Target Transfer Tag and LUN are copied in the        SCSI Data-Out PDUs and are only used by the target. There is no        protocol rule about the Target Transfer Tag except that the        value 0xfffffff is reserved.

At step 646, the initiator responds the R2T PDU by sending a SCSIData-out PDU. The SCSI Data-out PDU comprises a Buffer Offset field thatcontains the offset of the PDU payload data within the complete datatransfer. Thus, the starting point of each SCSI Data-out PDU can bedetermined by adding the Buffer Offset of the SCSI Data-out PDU to theLogical Block Address field of the WRITE command PDU.

At steps 648 and 650, the target communicates two more R2T PDUs to theinitiator. The initiator responds at steps 652 and 654 with two SCSIData-out PDUs.

At step 656, the target transmits a SCSI response PDU to indicate thatthe WRITE command is complete.

FIGS. 9A, 9B, 9C, and 9D illustrate a 64 KB Write command example. Inthis example, the data associated with a write command is temporarilyplaced in buffer memory 130 before being written to the target device.If there is a contiguous 64 KB block of memory free in the buffer 130,then the entire block of written data may be placed in a single locationin the buffer 130. However, there may not be a single contiguous blockof memory in the buffer 130 that is large enough to hold all of the dataassociated with the write command. In the illustrated example, the dataassociated with a single write command is placed into three differentlocations in the buffer memory 130. As shown in FIG. 9A, 16 KBytes ofdata are placed at buffer locations 0x54000-0x57fff, 32 KBytes of dataare placed at buffer locations 0x68000-0x6ffff; and 16 KBytes of dataare placed at buffer locations 0x80000-0x83fff. These memory locationsare represented as the non-hatched portions of the buffer shown in FIG.9A.

The table shown in FIG. 9B summarizes data for three example R2T PDUsthat correspond to the use of the buffer memory identified in FIG. 9A. Atarget device uses the Desired Data Transfer Length and Buffer Offsetfields of an R2T PDU to signify which data the target is ready toreceive. The Buffer Offset represents the offset within the completedata transfer. The Desired Data Transfer Length represents the size ofthe data transfer authorized by a specific R2T PDU. Thus, as illustratedin FIG. 9B, PDU #1 corresponds to a R2T PDU for buffer memory locations0x54000-0x57fff (0x4000 bytes long with an offset of 0x0000 bytes withinthe complete data transfer); PDU #2 corresponds to a R2T PDU for buffermemory locations 0x68000-0x6ffff (0x8000 bytes long with an offset of0x4000 bytes within the complete data transfer); and PDU #3 correspondsto a R2T PDU for buffer memory locations 0x80000-0x83fff (0x4000 byteslong with an offset of 0xc000 bytes within the complete data transfer).One of ordinary skill will also understand that multiple R2T PDUs may beused to place data in a single contiguous block of buffer memory 130, orthat an index table may be used to allow a single R2T PDU to correspondto more than one contiguous block of buffer memory 130.

Each R2T PDU includes a Target Transfer Tag (TTT). The target generatesa Target Transfer Tag for each R2T PDU that it sends to the initiator.By giving each R2T PDU a unique Target Transfer Tag, the target candetermine which incoming data corresponds to each R2T PDU.

The target knows the Logical Block Address field and the Buffer Offsetfor each SCSI Data-out PDU that the target receives. Thus, the targethas sufficient information to perform a WRITE command without using theTarget Transfer Tag other than for identification purposes.

In addition to providing identification, the Target Transfer Tag mayalso be advantageously configured to assist in efficient processing of aData Out PDU. For example, the Target Transfer Tag may be configured toact as a pointer for providing direct memory access to the buffer 130.In one embodiment, the target configures some or all of the bits in theTarget Transfer Tag to be a direct pointer into the buffer 130. When thetarget receives a corresponding Data Out PDU, the Target Transfer Tagprovides a pointer for placing the associated data. This method has theadvantage of quick address processing, but providing direct access tomemory may make it more difficult to prevent security breaches.

Alternatively, some or all of the 32-bits of the Target Transfer Tag maybe used as an index to a Data Pointer Table. FIG. 9C shows a DataPointer Table where the index corresponds to a portion of the TargetTransfer Tag from FIG. 9B. In this example, Target Transfer Tag bits16-31 provide the index into the Data Pointer Table. Thus, a TargetTransfer Tag having a value of 0x0010_(—)0123 indicates a Data PointerTable offset of 0x0010, a Target Transfer Tag having a value of0x0010_(—)0123 indicates a Data Pointer Table offset of 0x0011, and aTarget Transfer Tag having a value of 0x0012_(—)0123 indicates a DataPointer Table offset of 0x0012.

The Data Pointer Table may include information such as a pointer to anaddress in buffer memory 130, the Desired Data Transfer Length from theR2T PDU, the Buffer Offset from the R2T PDU, information about thesocket ID, or a pointer to the target device.

The Data Pointer Table entries are initialized, for example, when an R2TPDU is sent to the initiator. FIG. 9C shows an exemplary embodiment of aData Pointer Table. The entries include a pointer to the location inbuffer memory 130 that corresponds with the R2T PDU, a field thatcorresponds to the Desired Data Transfer Length field of the R2T PDU,and a field that corresponds to the Buffer Offset field of the R2T PDU.

The information in the Data Pointer Table allows for calculation of thedestination address in buffer memory 130 for data within the Data OutPDU. When a Data Out PDU is received, the offset in the R2T PDU BufferOffset field is subtracted from the Data Out PDU Buffer Offset and addedto the pointer in the Data Pointer Table to come up with the buffermemory address.

The table shown in FIG. 9D summarizes several exemplary Data Out PDUsthat correspond to the R2T PDUs described with respect to FIGS. 9A, 9B,and 9C. Data Out PDUs 1 and 2 are transmitted in response to R2T PDU 1;Data Out PDUs 3, 4, 5 and 6 are transmitted in response to R2T PDU 2;and Data Out PDUs 7 and 8 are transmitted in response to R2T PDU 3. Asillustrated, the Target Transfer Tag of the Data Out PDUs correspondwith their respective R2T PDUs. The Data Out PDU Buffer Offsetscorrespond with the offset of the PDU payload data within the completedata transfer. The Data Out PDU Data Segment Length is the data payloadlength of the Data Out PDU. Using this information in combination withthe Data Pointer Table allows for direct memory access into the buffermemory 130. Advantageously, the direct memory access reduces theoverhead associated with moving data from location to location.

The Target Transfer Tag may also be used to verify that the Data Out PDUis coming from a recognized connection. For example, all or a portion ofthe Target Transfer Tag may correspond to an iSCSI socket ID. The iSCSISocket ID may be determined from the iSCSI PDU, for example, by using acontent addressable memory (CAM). In one embodiment, the search data forthe CAM comprises the 32-bit IP source address, the 32-bit IPdestination address, the 16-bit TCP source port, and the 16-bit TCPdestination port. This allows for an arbitrary number of TCP destinationport matches. The search result is the Socket ID. In another embodiment,the search data for the CAM comprises the 32-bit IP source address, the32-bit IP destination address, and either the 16-bit TCP source port orthe 16-bit TCP destination port. An additional bit is used to indicatetarget or initiator mode, which controls whether the TCP source port orthe TCP destination port is used in the CAM lookup. In the exampleillustrated in FIG. 9B, bits 15:0 of the Target Transfer Tag correspondto the socket ID.

In a presently preferred embodiment, the iSCSI Socket ID is comparedwith information extracted from the iSCSI PDU to verify that the iSCSIPDU has permission to access the memory location associated with thepointer in the lookup table. When the resulting Socket ID from the CAMcorresponds with the Socket ID in the lookup table, permission is givento access the memory location listed in the lookup table. An additionalcheck may be used to confirm that the memory offset of the iSCSI PDU iswithin the allowed memory range.

One embodiment uses information stored in the Data Pointer Table toprevent or limit data corruption. For example, the Data Pointer Tableoffset encoded in the Target Transfer Tag may be compared with theoffsets contained in previously transmitted R2T PDUs. If the DataPointer Table offset is not consistent with an offset from a previouslytransmitted R2T PDU, the packet may contain corrupt data.

Storing the Desired Data Transfer Length allows the target to check foraccesses to memory locations that are outside of the range provided inthe R2T PDU. If the Data Out PDU contains data that is directed to alocation in buffer memory 130 that exceeds the range authorized by theR2T PDU, the packet may contain corrupt data. It is to be understoodthat other range checking methods such as providing a lower limit and anupper limit in the lookup table may also be used to prevent accesses tomemory locations outside of the range provided in the R2T PDU.

When a packet is suspected to contain corrupt information, the packetcan be ignored, or further analysis may be performed to possibly correctthe corruption. Corrupt data may indicate a malicious attempt tosabotage data, and further security measures may be taken to limitaccess.

The specific embodiments described herein are merely illustrative.Although described in terms of certain preferred embodiments, otherembodiments that are apparent to those of ordinary skill in the art,including embodiments which do not provide all of the benefits andfeatures set forth herein, are also within the scope of this invention.For example, the foregoing embodiments described herein have focused onproviding direct memory access to a buffer memory within a targetdevice. A skilled artisan will appreciate, in light of this disclosure,how to make, use, and practice embodiments of the invention in whichdirect memory access is provided to a buffer memory within an initiatordevice. Just as a target device, in one embodiment, transmits a TargetTransfer Tag from which a pointer to the target device's buffer memorycan be derived, in one embodiment an initiator device can transmit anInitiator Task Tag configured with substantially the same structure andto perform substantially the same function. That is, based oninformation in the Initiator Task Tag, a target device may have directaccess to a location in the buffer memory of the initiator device.Furthermore, it will be readily apparent to a skilled artisan, in lightof this disclosure, that an initiator device can be adapted to supportall of the structures and features, described herein with reference totarget devices, that are configured to allow direct access to a buffermemory, including, for example, a Data Pointer Table, an index into theData Pointer Table, a connection lookup table, and the like. Thisdisclosure encompasses this and other alternative embodiments that willbe appreciated by a skilled artisan in light of this disclosure.

Accordingly, it is to be understood that the patent rights arisinghereunder are not to be limited to the specific embodiments or methodsdescribed in this specification or illustrated in the drawings, butextend to other arrangements, technology, and methods, now existing orhereinafter arising, which are suitable or sufficient for achieving thepurposes and advantages hereof. The claims alone, and no other part ofthis disclosure, define the scope of the invention.

1. A method of storing data in a directly accessible buffer memory of a storage networking device, the method comprising: receiving storage networking data and first locational data over a network from a remote storage networking device, wherein the storage networking data includes at least one command for at least partially controlling a device attached to a storage network and is transmitted using a protocol adapted for the transmission of storage networking data, and wherein the first locational data is configured to specify at least indirectly a location within a buffer memory of a storage networking device; determining based at least in part on the first locational data, a location within the buffer memory; storing within the buffer memory, at the location determined at least in part by the first locational data, the storage networking data; and transmitting second locational data to a remote storage networking device, wherein the first locational data is substantially the same as the second locational data, such that the storage networking device assigns the location within buffer memory that the storage networking data is stored; wherein determining a location includes generating from the first locational data a pointer into the buffer memory; wherein generating the pointer includes extracting from a part of the first locational data an index into a data pointer table and using the index to extract the pointer from the data pointer table; wherein the part of the first locational data comprising an index is encrypted within the first locational data; and wherein the protocol adapted for the transmission of storage networking data comprises iSCSI and wherein receiving the storage networking data and the first locational data includes receiving the storage networking data and the first locational data within a first Protocol Data Unit and transmitting the second locational data includes transmitting the second locational data in a second Protocol Data Unit.
 2. The method of claim 1, further comprising comparing a sum of an offset from the first Protocol Data Unit and a data length from the first Protocol Data Unit with a data length from the second Protocol Data Unit. 